Fast Pattern-Matching via k-bit Filtering Based Text Decomposition
نویسندگان
چکیده
This study explores an alternative way of storing text files to answer exact match queries faster. We decompose the original file into two parts as filter and payload. The filter part contains the most informative k bits of each byte, and the remaining bits of the bytes are concatenated in the order of appearance to generate the payload. We refer to this structure as k-bit filtered format. When an input pattern is to be searched on the k-bit filtered structure, the same decomposition is performed on the pattern. The k bits from each byte of the pattern form the pattern filter bit sequence, and the rest is the payload. The pattern filter is first scanned on the filter part of the file. At each match position detected in the filter part, the pattern payload is verified against the corresponding location in the payload part of the text. Thus, instead of searching an m-byte pattern on an n-byte text, first k ·m bits are scanned on k ·n bits, followed by a verification of (8−k) ·m bits on the respective locations of the matching positions. Experiments conducted on natural language texts, plain ASCII DNA sequences and random byte sequences showed that the search performance with the proposed scheme is on average two times faster than the tested best exact pattern-matching algorithms. The highest gain is obtained on plain ASCII DNA sequences. We also developed an effective bitwise pattern-matching algorithm of possible independent interest within this study.
منابع مشابه
Inexact Pattern Matching Algorithms via Automata
Pattern matching occurs in various applications, ranging from simple text searching in word processors to identification of common motifs in DNA sequences in computational biology. The problem of exact pattern matching has been well studied and a number of efficient algorithms exist. However these exact pattern matching algorithms are of little help when they are applied to finding patterns in ...
متن کاملMulti-pattern Matching with Wildcards
Multi-pattern matching with wildcards is to find all the occurrences of a set of patterns with wildcards in a text. This problem arises in various fields, such as computational biology and network security. But the problem is not extensively studied as the single pattern case and there is no efficient algorithm for this problem. In this paper, we present efficient algorithms based on the fast F...
متن کاملApproximate Pattern Matching Over the Burrows-Wheeler Transformed Text
The compressed pattern matching problem is to locate the occurrence(s) of a pattern P in a text string T using a compressed representation of T , with minimal (or no) decompression. In this paper, we consider approximate pattern matching directly on Burrow-Wheeler transformed (BWT) text which is a critical step for a fully compressed pattern matching algorithm on a BWT based compression algorit...
متن کاملBit-Parallel Approximate String Matching Algorithms with Transposition
Using bit-parallelism has resulted in fast and practical algorithms for approximate string matching under the Levenshtein edit distance, which permits a single edit operation to insert, delete or substitute a character. Depending on the parameters of the search, currently the fastest non-filtering algorithms in practice are the O(kn!m/w") algorithm of Wu & Manber, the O(!km/w"n) algorithm of Ba...
متن کاملEmpirical Mode Decomposition based Adaptive Filtering for Orthogonal Frequency Division Multiplexing Channel Estimation
This paper presents an empirical mode decomposition (EMD) based adaptive filter (AF) for channel estimation in OFDM system. In this method, length of channel impulse response (CIR) is first approximated using Akaike information criterion (AIC). Then, CIR is estimated using adaptive filter with EMD decomposed IMF of the received OFDM symbol. The correlation and kurtosis measures are used to sel...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Comput. J.
دوره 55 شماره
صفحات -
تاریخ انتشار 2012